December 17th, 2015

Bio

  • B.A. in Math & Economics, Saint John’s University (06-10)
  • Data Analyst, College Enrollment & Financial Aid Consulting Firm (10-11)
  • M.S. in Statistics, Iowa State University (11-13)
    • Thesis: Tools for Collecting and Analyzing MLB PITCHf/x Data (pitchRx).
  • Research Intern, Statistics Research Department at AT&T Labs (Summer '13)
    • Worked with Dr. Kenny Shirley on LDAvis and LDAtools.
  • Took PhD courses and passed written qualifying exam (13-14)
  • Student, Google Summer of Code (Summer '14)
    • Began work on animint
  • Teaching Assistant, Iowa State University (11-15)
  • Mentor, Google Summer of Code (Summer '15)
  • Software Developer, plotly (Summer '15 - Present)
  • Research Assistant, Monash University (Sept '15 - Present)

Proposal Overview

  • The importance of interface design
  • Interfaces for working with web content
  • Interfaces for acquiring data on the web
  • Dynamic interactive statistical web graphics
    • Why interactive?
    • Indirect versus direct manipulation
    • Linked views and pipelines
    • Web graphics
    • Translating R graphics to the web
    • R interfaces for interactive web graphics

Motivation

  • Why interactive & dynamic graphics? They help us:
    • Find high-dimensional, abstract relationships in data that may otherwise go unnoticed
    • Diagnose models by visualizing them in the data space (Wickham, Cook, & Hofmann 2015)
    • Explore and understand complicated fitted statistical model(s) (Example to follow)
    • Communicate/share our work with others in a compelling way
  • Why web-based?
    • simple to share, portable (web browser)
    • encourages composability
    • guide your audience by providing links to interesting selections/states

Latent Dirichlet Allocation (Blei, Ng, Jordan; 2003)

In the digital humanities (& elsewhere), LDA is often used to "discover topics" in a large collection of text documents. How are researchers supposed to interpret these topics?

Generative Model

  1. For each document \(d\), draw topic distribution \(\theta_d \sim Dir(\alpha)\)
  2. For each topic \(k\), draw term distribution \(\phi_k \sim Dir(\beta)\)
  3. Let \(N_d\) be # of words in doc \(d\) and \(n \in \{1, \dots, N_d\}\). For each word \(w_{d, n}\):
    • Draw a (latent) topic, \(z_{d, n} \sim Mult(1, \theta_d)\)
    • Draw a word given topic, \(w_{d, n} \sim Mult(1, \phi_{z_{d, n}})\)

Model fitting

  • Griffiths & Steyvers (2004) derive a collapsed Gibbs sampler. Implemented in R packages LDAtools (Shirley & Sievert, 2013) and lda (Chang, 2015).
  • Wide array of fitting algorithms available in topicmodels (Grun & Hornik, 2011) and mallet (Mimno, 2013).

Towards topic interpretation

  • Each topic owns a different pmf over a set vocabulary.
  • Problem: We can't possibly examine each pmf. Where should we put our focus?
    • Numerous interactive systems allow users to select a topic \(z\), then list top ~30 words based on \(p(w | z)\) (Gardner et al., 2010; Chaney and Blei, 2012; Snyder et al., 2013).
    • But, words likely to occur overall are also likely to occur for a given topic!
    • Taddy (2011) proposed to rank terms by \(lift = p(w | z)/p(w)\)
    • But if \(p(w)\) is small, \(lift\) is large!
    • Bischof and Airoldi (2012) propose a new model to directly estimate an average frequency and exclusivity to a given topic.
    • Sievert & Shirley (2014) propose choosing \(0 < \lambda < 1\), for: \[ \text{relevance}(\lambda) = \lambda * p(w|z) + (1 - \lambda) * \text{lift} \]

Who is using it?

  • People who use LDA and want a tool for interpreting topics.
  • Combined LDAvis and pyLDAvis currently have 356 stars on GitHub (a measure of popularity).
  • I know a number of consultants and industry workers using it for both exploration and presentation.
  • Dr. Grant Arndt in the Department of Anthropology at Iowa State University (and his research assistant) are using it as a research aid.
  • Enabling users to extract and communicate insight deriving from sophicated statistical models.

We need better tools

  • Producing interactive and dynamic web graphics from "scratch" (i.e., using HTML/JavaScript/CSS/SVG/d3js) is time-consuming, but very powerful, and flexible.
  • People doing data analysis & statistics don't have the time to learn all these tools. In general, how do we best enable them to create their own interactive dynamic web graphics?
  • I've worked on two R packages in this direction: animint and plotly.
  • Both can translate ggplot2 (Wickham, 2009) graphics to a web-based format (SVG/canvas) and add-on some basic interactive features.
  • Mention the userbase.

library(ggplot2)
p <- qplot(data = iris, x = Sepal.Width, y = Sepal.Length, color = Species)
p

library(plotly)
ggplotly(p)

library(animint)
animint2dir(list(plot = p))

Fix typo!!!

Translating R graphics to the web

  • Pros:
    • Easy to use – extrapolates on existing knowledge/code
    • Doesn't require a Web Server running special software
  • Cons:
    • Translation may depend on internals of other packages
    • To change something that's serialized, you need to re-run R code
    • Hard to extend, customize, and/or add (interactive) features
  • Although pragmatic, if we want a truly interactive web graphics tool, we need a custom interface/language designed for that purpose.
  • Many relevant R packages provide bindings to JavaScript libraries through a JSON specification (e.g., ggvis (Chang & Wickham, 2015), rbokeh (Hafen & Bokeh team, 2015), plotly (Sievert & Plotly team, 2015))

R Bindings to JavaScript Libraries

  • General idea:
    • Start with a HTML/JS/CSS template
    • Abstract away data and layout/appearance options
    • Map a set of R objects to template
myWrapper <- function(...) {
  # compute stuff
  toJSON(list(...))
}
  • The R package htmlwidgets makes it easy for authors to write bindings that play nicely with shiny/rmarkdown/RStudio.

library(plotly)
plot_ly(z = volcano, type = "surface")

Talk about pipes!

p <- plot_ly(z = volcano, type = "surface")
str(p)
#> Classes ‘plotly’ and 'data.frame':   0 obs. of  0 variables
#>  - attr(*, "plotly_hash")= chr "d72417c2f38125f11112cd6591f06f2e#2"

str(plotly_build(p))
#> List of 4
#>  $ data          :List of 1
#>   ..$ :List of 3
#>   .. ..$ type      : chr "surface"
#>   .. ..$ z         : num [1:87, 1:61] 100 101 102 103 104 105 105 106 107 108 ...
#>   .. ..$ colorscale:'data.frame':    10 obs. of  2 variables:
#>   .. .. ..$ : num [1:10] 0 0.111 0.222 0.333 0.444 ...
#>   .. .. ..$ : Factor w/ 10 levels "#1F9D89","#26838E",..: 6 7 5 3 2 1 4 8 9 10
#>  $ layout        :List of 1
#>   ..$ zaxis:List of 1
#>   .. ..$ title: chr "volcano"

plot_ly(economics, x = date, y = uempmed, mode = "markers") %>%
  add_trace(y = fitted(forecast::Arima(uempmed, c(1,0,0))), mode = "lines") %>%
  subset(uempmed == max(uempmed)) %>%
  layout(annotations = list(x = date, y = uempmed, text = "Peak", showarrow = T),
         title = "Median duration of unemployment (in weeks)", showlegend = F)

Enabling coordinated, linked views

  • Coordinated, linked views is an important quality of any interactive statistical graphics system (e.g., cranvas, ggobi, iplots, mondrian, MANET, etc).
  • In order to have linked views, we need a "data pipeline" (Buja et.al, 1988); (Wickham et. al., 2010).

Things I'd like to work on

  • Easy:
    • Once plotlyjs has native support for a "selection brush", allow users to access selections in shiny (just like click example).
    • Interface for binding to plotly events with JavaScript callbacks (for users that do know JS).
  • Hard:
    • Interface for binding to plotly events without JavaScript callbacks (for users that don't know JS).

Timeline

  • December: Revise and resubmit book chapter on MLB Pitching Expertise and Evaluation for the Handbook of Statistical Methods for Design and Analysis in Sports, a volume that is planned to be one of the Chapman & Hall/CRC Handbooks of Modern Statistical Methods.
  • January: Revise and submit animint paper.
  • Feburary: Linked views in plotly.
  • April: Write and submit curating data paper.
  • June: Write and submit interactive web graphics paper.
  • August: Thesis defense.

Thanks to my collaborators

  • LDAvis (Kenny Shirley)
  • animint (Toby Dylan Hocking, Susan VanderPlas, Kevin Ferris, and Tony Tsai)
  • plotly (Toby Dylan Hocking and the Plotly Team)